This workshop walks through the assignment for the accompanying course, Linear Regression and a Space Scam, focusing on the background behind linear regression and performing hypothesis tests manually.

The objective was to use linear regression to fit the following two equations to our data, and then determine which of them is best. First, we assume only one radioactive isotope is present, in which case our equation is $\Delta N(t)=A r_1 e^{-r_1 t}+C$, and then we test a second form that includes a second isotope, $\Delta N(t)=A r_1 e^{-r_1 t}+B r_2 e^{-r_2 t}+C$. Here, $A$ and $B$ represent the relative amount of each isotope present in the initial fuel same, and $C$ is the level of background radiation. The decay rates $r_1$ and $r_2$ are 0.4 and 0.1 respectively.



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline

The following cell sets some default styling for our plots, based both on personal preferences and knowing that larger plots are easier to see in screen-shares. Additionally, I'm suppressing some warnings that arise because I'm using the latest version of Python.



In [2]:

    
matplotlib.rcParams.update({'font.size': 22,
                            'font.family': 'serif',
                            'figure.figsize': (12,8),
                            'text.usetex': True,
                            'lines.linewidth': 2})
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)

It will be useful to define our constants and a function representing the decay of an isotope with decay rate $r$ at time $t$.



In [3]:

    
r1, r2 = 0.4, 0.1
decay = lambda t, r: r*np.exp(-r*t)

The data are hosted in GitHub and can be directly downloaded with pandas. Additionally, I'm going to extract the columns into individual NumPy arrays for brevity and manual manipulation later.



In [4]:

    
d = pd.read_csv('https://github.com/joetrollo/thinkful-library/'
                'raw/master/linear-regression/engine.csv')
t, N = d.as_matrix().T

Before we begin our analysis, it's worth the few minutes to plot the data and see what we're dealing with.



In [5]:

    
plt.scatter(t, N, color='r', edgecolor='', s=50)
plt.xlabel('$t$')
plt.ylabel('$\Delta N$')
plt.show()

Some introductions to linear regression don't make it clear that, even though the above plot is curved, the model still falls withing the realm of "linear". Let's review a basic example of linear regression and then generalize it for more applications. Consider the following plot of randomly generated data, following the equation $y=mx+b+\varepsilon$:



In [6]:

    
n = 50 # number of data points
m = 0.5 # slope
b = -1 # intercept
e = 0.5 # intensity of noise
# We'll generate n x-values
x = np.random.uniform(-2, 8, size=n)
# Then the corresponding y-values
y = m*x + b
# And add random noise to the output
y += np.random.normal(scale=e, size=n)
x2 = np.r_[-2, 8]

plt.plot(x2, m*x2 + b, 'k-', zorder=-10)
plt.gca().set_aspect('equal')
plt.scatter(x, y, c='r', edgecolor='', s=50)
plt.xlim(-3, 9)
plt.ylim(-3, 4)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.show()

The particular form of linear regression we'll be using here is "ordinary least squares" (OLS), in which we attempt to find estimates of the parameters $m$ and $b$, denoted by $\hat m$ and $\hat b$, that minimize the total squared distance between the model output $\hat y$ and the observed output $y$. This sum of squared residuals (SSR) can be written as

$$ \mathrm{SSR} = \sum_{i=1}^{n} (y_i - \hat y)^2 = \sum_{i=1}^{n} (y_i - \hat m x_i - \hat b)^2 $$

Note that this method minimizes the vertical distance between the model and the points, not the orthogonal distance (which is called "total least squares"). The assumption we make when using OLS is that the noise $\varepsilon$ affects the model's output only—not the input variables. You can compare the two methods in the following plots:



In [7]:

    
plt.figure(figsize=(12, 8))

axs = [plt.subplot(121), plt.subplot(122)]
for ax in axs:
    ax.set_aspect('equal')
    ax.scatter(x, y, c='r', edgecolor='', zorder=10)
    ax.set_xlim(-3, 9)
    ax.set_ylim(-3, 4)
    ax.plot([-3, 9], [-3*m+b, 9*m+b], 'k-')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')

axs[0].set_title('Ordinary least squares')
axs[1].set_title('Total least squares')

for i, j in zip(x, y):
    axs[0].plot([i, i], [j, m*i+b], 'k-', lw=1)
for i, j in zip(x, y):
    axs[1].plot([i, (i+m*j-m*b)/(m*m + 1)],
                [j, m*(i +m*j-m*b)/(m*m+1) + b], 'k-', lw=1)
plt.show()

The previous equation for the SSR can be solved for $\hat m$ and $\hat b$ in that expanded form, but it becomes cumbersome when we include additional variables. Instead, we'll represent this equation in matrix form. We'll put our $y_i$ into a column,

$$ Y = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] $$

and our inputs $x_i$ into a matrix with a column representing our constant term,

$$ X = \left[ \begin{array}{cc} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{array} \right]. $$

This matrix of input $X$ is called the design matrix for our model. Finally, our parameters $m$ and $b$ will go into a matrix of their own,

$$ \beta = \left[ \begin{array}{c} m \\ b \end{array} \right]. $$

Notice that the product of $X$ and $\beta$ gives us

$$ X \cdot \beta = \left[ \begin{array}{cc} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_n & 1 \end{array} \right] \cdot \left[ \begin{array}{c} m \\ b \end{array} \right] = \left[ \begin{array}{c} m x_1 + b \\ m x_2 + b \\ \vdots \\ m x_n + b \end{array} \right], $$

and so our model can be represented in a much shorter equation, $Y = X\beta + \varepsilon$, regardless of how many input variables we have. Now, we're aiming to find the parameters $\hat \beta$ that minimize the SSR, whose new form is

$$ \mathrm{SSR} = (Y-X\hat\beta)^T(Y-X\hat\beta) $$

where $T$ represents the matrix transpose. If we differentiate the equation with respect to $\hat\beta$ and set it equal to zero, we find that the SSR is at a minimum when $\hat\beta = (X^TX)^{-1}X^TY$. Let's calculate this value for our example above and compare it to the true values. This code is written in Python 3.5, which introduced a dedicated operator for matrix multiplication, @. In case you're using an older version of Python, all the following statements are equivalent:

C = A @ B
C = A.dot(B)
C = np.dot(A, B)

Also note that np.c_[] is used here as shorthand for np.colstack(). See also the documentation for np.r_[] which has better examples.



In [8]:

    
# Create the design matrix
X = np.c_[x, np.ones_like(x)]
# Evaluate the matrix equation
print(np.linalg.inv(X.T @ X) @ X.T @ y)









    



[ 0.54205454 -1.09198988]

These values are in close agreement with what was originally used to generate the data. Going forward, it's preferable to use np.linalg.solve(A, B) in place of np.linalg.inv(A) @ B, as it is more numerically stable.

Let's now consider our primary problem of modeling $\Delta N(t)=A r_1 e^{-r_1 t}+C$. Because we are only interested in estimating $A$ and $C$, we can isolate $r_1 e^{-r_1 t}$ and turn it into a single variable instead of a combination of variables and constants. Our design matrix and parameter matrix now become

$$ X \cdot \beta = \left[ \begin{array}{cc} r_1 e^{-r_1 t_1} & 1 \\ r_1 e^{-r_1 t_2} & 1 \\ \vdots & \vdots \\ r_1 e^{-r_1 t_n} & 1 \end{array} \right] \cdot \left[ \begin{array}{c} A \\ C \end{array} \right] $$

With our matrix equations, we can quickly calculate the parameters for this first model:



In [9]:

    
X = np.c_[decay(t, r1), np.ones_like(t)]
B = np.linalg.solve(X.T @ X, X.T @ N)
print(B)









    



[ 0.94493838  0.05543361]

Instead of our first plot that showed $\Delta N$ against $t$, let's plot $\Delta N$ against the first column of the design matrix, and a line using the parameters we just calculated. Remember that $\hat Y = X \hat B$, which equates to Y = X @ B.



In [10]:

    
plt.scatter(X[:,0], N, color='r', edgecolor='', s=50)
plt.plot(X[:,0], X @ B, 'k-')
plt.xlabel('$r_1 e^{-r_1 t}$')
plt.ylabel('$\Delta N$')
plt.show()

The transformed data also resemble a line, matching our intuition of what it means to be "linear" in the context of linear regression. The power of the matrix equation we used becomes evident when we attempt to repeat the process for the second model, $\Delta N(t)=A r_1 e^{-r_1 t}+B r_2 e^{-r_2 t}+C$. We just create the new design matrix and use the same equation:



In [11]:

    
X = np.c_[decay(t, r1), decay(t, r2), np.ones_like(t)]
B = np.linalg.solve(X.T @ X, X.T @ N)
print(B)









    



[ 0.80102059  0.1944868   0.04991807]

Let's visualize the data again, this time using two columns and plotting in three-dimensional space, adding in a plane defined by the parameters we just found:



In [12]:

    
from mpl_toolkits.mplot3d import Axes3D

x = np.linspace(-0.01, 0.06)
y = np.linspace(0.02, 0.07)
x, y = np.meshgrid(x, y)
z = B[0]*x + B[1]*y + B[2]

for r in np.arange(45, 180, 45):
    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(X[:,0], X[:,1], N, c='r', edgecolor='')
    ax.plot_surface(x, y, z, color='none')
    ax.azim -= r
    ax.set_xlabel('$r_1 e^{-r_1 t}$')
    ax.set_ylabel('$r_2 e^{-r_2 t}$')
    ax.set_zlabel('$\Delta N$')
    plt.show()

In higher dimensions, the data now lie on a plane. When performing linear regression, it's important to know that "linear" refers to our model being linear in the parameters, and does not impose any constraints on the variables we use. Mathematically, the matrix equations are all about lines and (hyper)planes, but the variables can be transformed any way we want before building the design matrix.

While these plots and equations have hopefully helped you to understand how linear regression works, you will likely never need to work with the raw matrices. Now let's run the same analysis using statsmodels. In particular, we'll be using the API from the statsmodels.formula module, which allows us to describe models as string equations. If we were using statsmodels.api directly, we'd have to create the design matrix manually.

The basic form for the formula API equations is simply y ~ x1 + x2 + ... + xn. Here, y is the output we wish to model and the xi are input variables. However, the API lets us use a pandas.DataFrame as input and reference variables by the column name. We can also run the variables through other functions, all in a single string. Note that the intercept is implicit and does not need to be included in your input equation. You can find more examples in the statsmodels documentation.



In [13]:

    
from statsmodels.formula import api as smf
m1 = smf.ols('N ~ decay(t, r1)', data=d).fit()
print(m1.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      N   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                     5413.
Date:                Wed, 11 Nov 2015   Prob (F-statistic):           7.97e-52
Time:                        16:28:24   Log-Likelihood:                 267.14
No. Observations:                  51   AIC:                            -530.3
Df Residuals:                      49   BIC:                            -526.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        0.0554      0.000    218.969      0.000         0.055     0.056
decay(t, r1)     0.9449      0.013     73.573      0.000         0.919     0.971
==============================================================================
Omnibus:                        1.196   Durbin-Watson:                   1.259
Prob(Omnibus):                  0.550   Jarque-Bera (JB):                0.579
Skew:                          -0.225   Prob(JB):                        0.749
Kurtosis:                       3.263   Cond. No.                         70.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This seems like a great fit, with a high $R^2$ and small $p$-values for the coefficients. Remember, these $p$-values are testing the null hypothesis that a coefficient is equal to 0, so a low $p$-value means that the probability of the coefficient being non-zero due to random chance is small.

Let's plot the fit and inspect it visually.



In [14]:

    
plt.scatter(t, N, color='r', edgecolor='', s=50)
plt.xlabel('$t$')
plt.ylabel('$\Delta N$')
plt.plot(t, m1.predict(), 'k-')
plt.show()

This seems like an excellent fit, and if we didn't suspect that there might be an extra term in the model, we might have stopped here. Let's create the second model and see how it compares.



In [15]:

    
m2 = smf.ols('N ~ decay(t, r1) + decay(t, r2)', data=d).fit()
print(m2.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      N   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                     3885.
Date:                Wed, 11 Nov 2015   Prob (F-statistic):           8.26e-54
Time:                        16:28:24   Log-Likelihood:                 276.81
No. Observations:                  51   AIC:                            -547.6
Df Residuals:                      48   BIC:                            -541.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        0.0499      0.001     41.896      0.000         0.048     0.052
decay(t, r1)     0.8010      0.032     24.705      0.000         0.736     0.866
decay(t, r2)     0.1945      0.041      4.704      0.000         0.111     0.278
==============================================================================
Omnibus:                        0.451   Durbin-Watson:                   1.841
Prob(Omnibus):                  0.798   Jarque-Bera (JB):                0.450
Skew:                           0.208   Prob(JB):                        0.798
Kurtosis:                       2.801   Cond. No.                         338.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This second models also has a high coefficient of determination, but not much higher. Let's compare them visually:



In [16]:

    
plt.scatter(t, N, color='r', edgecolor='', s=50)
plt.xlabel('$t$')
plt.ylabel('$\Delta N$')
plt.plot(t, m1.predict(), 'k--', lw=1, label='One isotope')
plt.plot(t, m2.predict(), 'k-', label='Two isotopes')
plt.legend()
plt.show()

The second model looks slightly better, but this slight visual difference isn't sufficient justification for choosing one over the other. The question is now: how can we objectively assess whether the two-isotope model is a better description of the data?

There are multiple ways to answer this question, but for this exercise we're going to use the likelihood-ratio test. This test will tell us how much more likely the data are in the new/alternative model compared to the first/null model, as long as the models are nested. (A pair of models is nested if one model is a special case of the other, e.g., our first model is a special case of the second model where we set $B=0$.) The test statistic that we need to perform this inference is

$$ D = -2 \ln \left( \frac{\mathcal L}{\mathcal L ^ \prime} \right) = 2 (\ln\mathcal L^\prime - \ln\mathcal L) $$

where $\mathcal L$ and $\mathcal L^\prime$ are the likelihood of the null and alternative models respectively. One of the important assumptions of linear regression is that the noise on $Y$ is normally distributed with a mean of 0. Based on the likelihood function of the normal distribution and the data in our model, statsmodels provides us with the log-likelihoods in the llf attribute of the model, which we can then use to calculate $D$.



In [17]:

    
D = 2 * (m2.llf - m1.llf)
print(D)









    



19.3346630104

On its own, this value doesn't mean much to us. However, because the two models are nested, we know this test statistic will follow a chi-squared distribution with the degrees of freedom equal to the difference in the degrees of freedom of the two models. In most cases, this difference is just equal to the difference in the number of parameters of the models. In our case, the statistic $D$ would follow a chi-squared distribution with one degree of freedom.

In the context of the chi-squared distribution, we can ask ourselves: what's the probability of attaining a value of $D$ or higher by chance? While many statistical tests in SciPy automatically provide the associated $p$-values, the likelihood-ratio test is not part of the package, so we'll have to calculate this ourselves. Let's begin by plotting the distribution, and letting $D=1$ for illustration purposes.



In [18]:

    
from scipy import stats

D = 1
dof = 1 # degrees of freedom
x = np.linspace(0, 2, 200)
y = stats.chi2.pdf(x, dof)

plt.plot(x, y, 'k-')
plt.vlines(D, 0, stats.chi2.pdf(D, dof), color='r')
plt.ylim(0, 3)
plt.xlabel('$x$')
plt.ylabel('$f_1(x)$')
plt.show()

The probability of attaining a value of $D$ or higher is equal to the integral of the distribution from $D$ to infinity, i.e.,

$$ p = \int_D^\infty f_1(x)dx $$

which visually corresponds to the hatched area in this next plot:



In [19]:

    
x2 = np.linspace(D, 2)
y2 = stats.chi2.pdf(x2, 1)

plt.plot(x, y, 'k-')
plt.vlines(D, 0, stats.chi2.pdf(D, 1), color='r')
plt.fill_between(x2, y2, facecolor='', hatch='//', edgecolor='r')
plt.ylim(0, 3)
plt.xlabel('$x$')
plt.ylabel('$f_1(x)$')
plt.show()

This integral is called the survival function, and is conveniently provided by the sf() method of the distribution. (The survival function is just the complement of the cumulative distribution function $CDF(x)$, i.e., $SF(x) = 1 - CDF(x)$.) Restoring the former value of $D$, we can easily find the associated $p$-value:



In [20]:

    
D = 2*(m2.llf - m1.llf)
p = stats.chi2.sf(D, dof)
print(p)









    



1.09696777676e-05

The likelihood-ratio has tested the null hypothesis that the data we observe are no more likely under the second model than the first, so this $p$-value tells us that the probability of the likelihood being greater in the second model due to random chance is very small. We can thus conclude that there are two isotopes in the fuel sample.

The remainder of the exercise requests that we determine what percentage of the initial fuel is contaminated, by calculating the ratio $P=B/(A+B)$, and also determine the associated uncertainty for this percentage. The parameters of the model and their associated uncertainties are available in the params and bse attributes, respectively, so we can extract them to perform our final calculation. The formula for the final uncertainty can be found by propagating uncertainty through the equation $P=B/(A+B)$, combining the second and fifth equations in this table.



In [21]:

    
C, A, B = m2.params
# The variance is the square of the standard error
vC, vA, vB = m2.bse**2
P = B/(A+B)
vP = P**2 * ( vA/(A**2) + (vA+vB)/((A+B)**2) )
print(P, np.sqrt(vP))









    



0.195364492818 0.0129946252408

While the estimated percent of contaminated fuel, 19.5%, is less than the 20% upper limit, given the uncertainty of 1.3% it isn't unreasonable for the true contamination to have exceeded 20%.